e-Chapter 6 


Similarity-Based Methods 


“It’s a manohorse”, exclaimed the confident little 
5 year old boy. We call it the Centaur out of habit, 
but who can fault the kid’s intuition? The 5 year old 
has never seen this thing before now, yet he came up 
with a reasonable classification for the beast. He is us- 
ing the simplest method of learning that we know of 
— similarity — and yet it’s effective: the child searches 
through his history for similar objects (in this case a 
man and a horse) and builds a classification based on 
these similar objects. 

The method is simple and intuitive, yet when we 
get into the details, several issues need to be addressed 
in order to arrive at a technique that is quantitative 
and fit for a computer. The goal of this chapter is to build exactly such a 
quantitative framework for similarity based learning. 





6.1 Similarity 


The ‘manohorse’ is interesting because it requires a deep understanding of 
similarity: first, to say that the Centaur is similar to both man and horse; 
and, second, to decide that there is enough similarity to both objects so that 
neither can be excluded, warranting a new class. A good measure of similarity 
allows us to not only classify objects using similar objects, but also detect the 
arrival of a new class of objects (novelty detection). 

A simple classification rule is to give a new input the class of the most 
similar input in your data. This is the ‘nearest neighbor’ rule. To implement 
the nearest neighbor rule, we need to first quantify the similarity between 
two objects. There are different ways to measure similarity, or equivalently 
dissimilarity. Consider the following example with 3 digits. 
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The two 9s should be regarded as very similar. Yet, if we naively measure 
similarity by the number of black pixels in common, the two 9s have only two 
in common. On the other hand, the 6 has many more black pixels in common 
with either 9, even though the 6 should be regarded as dissimilar to both 9s. 
Before measuring the similarity, one should preprocess the inputs, for example 
by centering, axis aligning and normalizing the size in the case of an image. 
One can go further and extract the relevant features of the data, for example 
size (number of black pixels) and symmetry as was done in Chapter 3. These 
practical considerations regarding the nature of the learning task, though im- 
portant, are not our primary focus here. We will assume that through domain 
expertise or otherwise, features have been constructed to identify the impor- 
tant dimensions, and if two inputs differ in these dimensions, then the inputs 
are likely to be dissimilar. Given this assumption, there are well established 
ways to measure similarity (or dissimilarity) in different contexts. 





6.1.1 Similarity Measures 


For inputs x,x’ which are vectors in Rf, we can measure dissimilarity using 
the standard Euclidean distance, 


d(x, x)= |x — x’. 


The smaller the distance, the more similar are the objects corresponding to 
inputs x and x’.! For Boolean features the Euclidean distance is the square 
root of the well known Hamming distance. The Euclidean distance is a special 
case of a more general distance measure which can be defined for an arbitrary 
positive semi-definite matrix Q: 


d(x, x’) = (x — x’)"Q(x — x’). 


A useful special case, known as the Mahalanobis distance is to set Q = £7}, 
where © is the covariance matrix ? of the data. The Mahalanobis distance 
metric depends on the data set. The main advantage of the Mahalanobis 
distance over the standard Euclidean distance is that it takes into account 
correlations among the data dimensions and scale. A similar effect can be 


1An axiomatic treatment of similarity that goes beyond metric-based similarity appeared 
in “Features of Similarity”, A. Tversky, Psychological Review, 84(4), pp 327-352, 1977. 


1k (eo! 
25 = — Xxix: — XX", where X = — Xi. 
Dye AR where aS ay Ds 
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accomplished by first using input preprocessing to standardize the data (see 
Chapter 9) and then using the standard Euclidean distance. Another useful 
measure, especially for Boolean vectors, is the cosine similarity, 
. F xex 
CosSim(x, x’) = ——_—.. 
I>" 

The cosine similarity is the cosine of the angle between the two vectors, 
CosSim € [—1,1], and larger values indicate greater similarity. When the 
objects represent sets, then the set similarity or Jaccard coefficient is often 
used. For example, consider two movies which have been watched by two dif- 
ferent sets of users S1, S2. We may measure how similar these movies are by 
how similar the two sets Sı and Sù are: 


— ($10 S], 


J(S1, 52) = 15, US| 


1— J(S1, S2) can be used as a measure of distance which conveniently has the 
properties that a metric formally satisfies, such as the triangle inequality. We 
focus on the Euclidean distance which is also a metric; many of the algorithms, 
however, apply to arbitrary similarity measures. 


Exercise 6.1 
(a) Give two vectors with very high cosine similarity but very low Euclidean 
distance similarity. Similarly, give two vectors with very low cosine 
similarity but very high Euclidean distance similarity. 


(b) If the origin of the coordinate system changes, which measure of 
similarity changes? How will this affect your choice of features? 


6.2 Nearest Neighbor 


Simple rules survive; and, the nearest neighbor technique is perhaps the sim- 
plest of all. We will summarize the entire algorithm in a short paragraph. 
But, before we forge ahead, let’s not forget the two basic competing principles 
laid out in the first five chapters: any learning technique should be expressive 
enough that we can fit the data and obtain low Ein; however, it should be 
reliable enough that a low Fin implies a low Eout. 

The nearest neighbor rule is embarrassingly simple. There is no training 
phase (or no ‘learning’ so to speak). The entire algorithm is specified by how 
one computes the final hypothesis g(x) on a test input x. Recall that the data 
set is D = (x1,y1).--(Xw, yn), Where yn = +1. To classify the test point x, 
find the nearest point to x in the data set (the nearest neighbor), and use the 
classification of this nearest neighbor. 

Formally speaking, reorder the data according to distance from x (breaking 
ties using the data point’s index for simplicity). We write (Xin) (X), Yin] (x)) 
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for the nth such reordered data point with respect to x. We will drop the 
dependence on x and simply write (X{n]; nj) when the context is clear. So, 


d(x, X{1)) < d(x, Xiz) L- < d(x, Xįn]) 
The final hypothesis is 
g(x) = yy (x) 


(x is classified by just looking at the label of the nearest data point to x). This 
simple nearest neighbor rule admits a nice geometric interpretation, shown for 
two dimensions in Figure 6.1. 

















Figure 6.1: Nearest neighbor Voronoi tessellation. 


The shading illustrates the final hypothesis g(x). Each data point x,, ‘owns’ a 
region defined by the points closer to x, than to any other data point. These 
regions are convex polytopes (convex regions whose faces are hyperplanes), 
some of which are unbounded; in two dimensions, we get convex polygons. 
The resulting set of regions defined by such a set of points is called a Voronoi 
(or Dirichlet) tessellation of the space. The final hypothesis g is a Voronoi 
tessellation with each region inheriting the class of the data point that owns 
the region. Figure 6.2(a) further illustrates the nearest neighbor classifier for 
a sample of 500 data points from the digits data described in Chapter 3. 

The clear advantage of the nearest neighbor rule is that it is simple and 
intuitive, easy to implement and there is no training. It is expressive, as it 
achieves zero in-sample error (as can immediately be deduced from Figure 6.1), 
and as we will soon see, it is reliable. A practically important cosmetic upside, 
when presenting to a client for example, is that the classification of a test object 
is easy to ‘explain’: just present the similar object on which the classification 
is based. The main disadvantage is the computational overhead. 
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VC dimension. The nearest neighbor rule fits within our standard super- 
vised learning framework with a very large hypothesis set. Any set of n labeled 
points induces a Voronoi tessellation with each Voronoi region assigned to a 
class; thus any set of n labeled points defines a hypothesis. Let Hn be the hy- 
pothesis set containing all hypotheses which result from some labeled Voronoi 
tessellation on n points. Let H = UX Hn be the union of all these hypothesis 
sets; this is the hypothesis set for the nearest neighbor rule. The learning 
algorithm picks the particular hypothesis from H which corresponds to the 
realized labeled data set; this hypothesis is in Hy C H. Since the training 
error is zero no matter what the size of the data set, the nearest neighbor rule 
is non-falsifiable 3 and the VC-dimension of this model is infinite. So, from 
the VC-worst-case analysis, this spells doom. A finite VC-dimension would 
have been great, and it would have given us one form of reliability, namely 
that Eout is close to Fin and so minimizing Ein works. The nearest neighbor 
method is reliable in another sense, and we are going to need some new tools 
if we are to argue the case. 


6.2.1 Nearest Neighbor is 2-Optimal 


Using a probabilistic argument, we will show that, under reasonable assump- 
tions, the nearest neighbor rule has an out-of-sample error that is at most 
twice the minimum possible out-of-sample error. The success of the nearest 
neighbor algorithm relies on the nearby point x;1)(x) having the same classi- 
fication as the test point x. This means two things: there is a ‘nearby’ point 
in the data set; and, the target function is reasonably smooth so that the 
classification of this nearest neighbor is indicative of the classification of the 
test point. 





We model the target value, which is +1, as noisy and define 
n(x) = Ply = +1]x]. 


A data pair (x,y) is obtained by first generating x from the input probability 
distribution P(x), and then y from the conditional distribution a(x). We 
can relate m(x) to a deterministic target function f(x) by observing that if 
T(x) > 4, then the optimal prediction is f(x) = +1; and, if r(x) < 4 then 
the optimal prediction is f(x) = —1.4 


3Recall that a hypothesis set is non-falsifiable if it can fit any data set. 
4When q(x) = $ we break the tie in favor of +1. 
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Exercise 6.2 


Let 


oe E if r(x) > 4, 


—1 otherwise. 


Show that the probability of error on a test point x is 


elf (x)) = PIF (x) # y] = min{x(x), 1 — m(x)} 


and e(f(x)) < e(h(x)) for any other hypothesis h (deterministic or not). 





The error e(f(x)) is the probability that y # f(x) on the test point x. The 
previous exercise shows that e(f(x)) = min{z(x),1— (x)}, which is the 
minimum probability of error possible on test point x. Thus, the best possible 
out-of-sample misclassification error is the expected value of e(f(x)), 











out = Ex[e(f(x))] = fax P(x) min{r(x), 1 — m(x)}. 





If we are to infer f(x) from a nearest neighbor using f(x;1)), then a(x) should 
not be varying too rapidly and x); should be close to x. We require that (x) 
should be continuous.” To ensure that there is a near enough neighbor X11) for 
any test point x, it will suffice that N be large enough. We now show that, 
if r(x) is continuous and N is large enough, then the simple nearest neighbor 
rule is reliable; specifically, it achieves an out-of-sample error which is at most 
twice the minimum achievable error (hence the title of this section). 

Let {Dy} be a sequence of data sets with sizes N = 1,2,... generated 
according to P(x,y) and let {gn} be the associated nearest neighbor decision 
rules for each data set. For N large enough, we will show that 

out (gn) < 2E 


out’ 


So, even though the nearest neighbor rule is non-falsifiable, the test error is at 
most a factor of 2 worse than optimal (in the asymptotic limit, i.e., for large 
N). The nearest neighbor rule is reliable; however, Ein, which is always zero, 
will never match Eout. We will never have a good theoretical estimate of Fout 
(so we will not know our final performance), but we know that you cannot do 
much better. The formal statement we can make is: 


Theorem 6.1. For any ô > 0, and any continuous noisy target m(x), there is 
a large enough N for which, with probability at least 1 — ô, Eout(gn) < 2E 


out’ 


The intuition behind the factor of 2 is that f(x) makes a mistake only if 
there is an error on the test point x, whereas the nearest neighbor classifier 
makes a mistake if there is an error on the test point or on the nearest neighbor 
(which adds up to the factor of 2). 


5Formally, it is only required that the assumptions hold on a set of probability 1. 
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More formally, consider a test point x. The nearest neighbor classifier 
makes an error on x if y = 1 and yp) = —1 or if y = —1 and ypj = 1. 


Plgn(x) Ay] = 


3 


(x) : (1 = z(xp))) + (1 = 10x) - a(x), 


where we defined n(x) = min{7(x),1 — m(x)} (the minimum probability of 
error on x), and en (X) = (2r (x) — 1) - (m(x) — m(xjp])). Observe that 


lew(x)| < r(x) = r(x), 


because 0 < m(x) < 1. To get Eout(gn), we take the expectation with respect 
to x and use Eu = Eln(x)] Ely(x)?] > Eln(x))? = Egu”: 


O 
















































































Eou (gn) = E Plgn(x) # yl] 
= 2E[n(x)] —2E [n°(x)] + Exlew(x)] 
<  2E ou. (1 — Bou) + Exlen (x)]- 











Under very general conditions, Ex[en(x)] — 0 in probability, which implies 
Theorem 6.1. We don’t give a formal proof, but sketch the intuition for why 
€n(x) > 0. When the data set gets very large (N — 00), every point x has 
a nearest neighbor that is close by. That is, x(x) —> x for all x. This is 
the case if, for example, P(x) has bounded support® as shown in Problem 6.9 
(more general settings are covered in the literature). By the continuity of m(x), 
if xp] > x then 7(xj1)) > a(x), and since Jen (x)| < |7(xf1)) — 7(x)], it follows 
that en (x) + 0.” The proof also highlights when the nearest neighbor works: 
a test point must have a nearby neighbor, and the noisy target m(x) should 
be slowly varying so that the y-value at the nearby neighbor is indicative of 
the y-value at x. 

When E>, is small, Eout(g) is at most twice as bad. When E%., is large, 
you are doomed anyway (but on the positive side, you will get a better factor 





than 2 ©). It is quite startling that as simple a rule as nearest neighbor is 
2-optimal. 


At least half the classification power of the data is in the nearest neighbor. 


The rate of convergence to 2-optimal depends on how smooth 7(x) is, but 
the convergence itself is never in doubt, relying only on the weak requirement 
that a(x) is continuous. The fact that you get a small Fout from a model 
that can fit any data set has led to the notion that the nearest neighbor rule 
is somehow ‘self-regularizing’. Recall that regularization guides the learning 


6x € supp(P) if for any e > 0, P[Be(x)] > 0 (Be(x) is the ball of radius € centered on x). 

7One technicality remains because we need to interchange the order of limit and expec- 
tation: limy Ex[e(x)] = Ex[limy—.o €(x)] = 0. For the technically inclined, this can be 
justified under mild measurability assumptions using dominated convergence. 
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toward simpler hypotheses, and helps to combat noise, especially when there 
are fewer data points. This self regularizing in the nearest neighbor algorithm 
occurs naturally because the final hypothesis is indeed simpler when there are 
fewer data points. The following examples of the nearest neighbor classifier 
with increasing number of data points illustrates how the boundary gets more 
complicated only as you increase N. 


NEN CR a 


N=2 N=3 = = = 





6.2.2 k-Nearest Neighbors (k-NN) 


By considering more than just a single neighbor, one obtains the generaliza- 
tion of the nearest neighbor rule to the k-nearest neighbor rule (k-NN). For 
simplicity, assume that k is odd. The k-NN rule classifies the test point x 
according to the majority class among the k nearest data points to x (the k 
nearest neighbors). The nearest neighbor rule is the 1-NN rule. For k > 1, 


the final hypothesis is: 
k 
x) = sign (>: Yia w) : 
i=1 


(k is odd and yn = +1). Figure 6.2 gives a comparison of the nearest neighbor 
rule with the 21-NN rule on the digits data (blue circles are the digit 1 and 
all other digits are the red x’s). When k is large, the hypothesis is ‘simpler’; 
in particular, when k = N the final hypothesis is a constant equal to the 
majority class in the data. The parameter k controls the complexity of the 
resulting hypothesis. Smaller values of k result in a more complex hypoth- 
esis, with lower in-sample-error. The nearest neighbor rule always achieves 
zero in-sample error. When k > 1 (see Figure 6.2(b)), the in-sample error 
is not necessarily zero and we observe that the complexity of the resulting 
classification boundary drops considerably with higher k. 





Three Neighbors is Enough. As with the simple nearest neighbor rule, 
one can analyze the performance of the k-NN rule (for large N). The next 
exercise guides the reader through this analysis. 
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ow ge San 
(a) 1-NN rule (b) 21-NN rule 











Figure 6.2: The 1-NN and 21-NN rules for classifying a random sample of 
500 digits (1 (blue circle) vs all other digits). Note, 21 ~ 500. For the 
1-NN rule, the in-sample error is zero, resulting in a complicated decision 
boundary with islands of red and blue regions. For the 21-NN rule, the 
in-sample error is not zero and the decision boundary is ‘simpler’. 


Exercise 6.3 


Fix an odd k > 1. For N = 1,2,... and data sets {Dy} of size N, let gn 
be the k-NN rule derived from Dy, with out-of-sample error Hout (gn). 


(a) Argue that Bout(gn) = Ex[Qz(n(x))] + Ex[en (x)] for some error 
term en (x) which converges to zero, and where 


i=0 
and 7(x) = min{7(x), 1 — 7(x)}. 
(b) Plot Q}(n) for n € [0, 5] and k = 1,3,5. 
(c) Show that for large enough N, with probability at least 1 — 6, 
k=3: — Eout(gn) < out + 3E [n°(x);] 
Or out (gn) S Heat +10E [n (x)]. 





(d) [Hard] Show that Eout(gx) is asymptotically E%,.(1 + O(k~1/?)). 
[Hint: Use your plot of Qk to argue that there is some a(k) such that 
Qr < (1+ a(k)), and show that the best such a(k) is O(1/Vk).] 


For k = 3, as N grows, one finds that the out-of-sample error is nearly optimal: 














out (gn) < E + 3 E [n? (x)]. 
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To interpret this result, suppose that (x) is approximately constant. Then, 
for k = 3, Eout(gn) < Exy,+3(E%,4)? (when N is large). The intuition here is 
that the nearest neighbor classifier makes a mistake if there is an error on the 
test x or if there are at least 2 mistakes on the three nearest neighbors. The 
probability of two errors is O(n?) which is dominated by the probability of 
error on the test point. The 3-NN rule is approximately optimal, converging 
to Fey, as E34, > 0. 

To illustrate further, suppose the optimal classifier gives E*,, = 1%. Then, 
for large enough N, 3-NN delivers Eout of at most 1.03%, which is essentially 
optimal. The exercise also shows that going to 5-NN improves the asymptotic 
performance to 1.001%, which is certainly closer to optimal than 3-NN, but 
hardly worth the effort. Thus, a useful practical guideline is: k = 3 nearest 
neighbors is often enough. 


Approximation Versus Generalization. The parameter k controls the 
approximation-generalization trade off. Choosing k too small leads to too 
complex a decision rule, which overfits the data; on the other hand, k too 
large leads to underfitting. Though k = 3 works well when E%,, is small, for 
general situations we need to choose between the different k: each value of k 
is a different model. We need to pick the best value of k (the best model). 
That is, we need model selection. 


Exercise 6.4 


Consider the task of selecting a nearest neighbor rule. What’s wrong with 
the following logic applied to selecting k? (Limits are as N — co.) 
Consider the hypothesis set Hnn with N hypotheses, the k-NN 
rules using k = 1,...,.N. Use the in-sample error to choose a 
value of k which minimizes Ein. Using the generalization error 
bound in Equation (2.1), conclude that Ein — Hout because 
log N/N — 0. Hence conclude that asymptotically, we will be 
picking the best value of k, based on Fin alone. 


[Hints: What value of k will be picked? What will Ein be? Does your 
‘hypothesis set’ depend on the data?] 


Fortunately, there is powerful theory which gives a guideline for selecting k as 
a function of N; let k(N) be the choice of k, which is chosen to grow as N 
grows. Note that 1 < k(N) < N. 


Theorem 6.2. For N > ov, if k(N) > œ and k(N)/N — 0 then, 
Fin(g) > Eout(g) and  Eout(g) > Eo 


out’ 


Theorem 6.2 holds under smoothness assumptions on 7(x) and regularity 
conditions on P(x) which are very general. The theorem sets k(N) as a func- 
tion of N and states that as N — oo, we can recover the optimal classifier. 
One good choice for k(N) is | VN]. 
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We sketch the intuition for Theorem 6.2. For a test point x, consider the 
distance r to its k-th nearest neighbor. The fraction of points which fall within 
distance r of x is k/N which, by assumption, is approaching zero. The only 
way that the fraction of points within r of x goes to zero is if r itself tends to 
zero. That is, all the k nearest neighbors to x are approaching x. Thus, the 
fraction of these k neighbors with y = +1 is approximating a(x) and as k — oo 
this fraction will approach 7(x) by the law of large numbers. The majority vote 
among these k neighbors will therefore be implementing the optimal classifier f 
from Exercise 6.2, and hence we will obtain the optimal error E3,,.. We can 
also directly identify the roles played by the two assumptions in the statement 
of the theorem. The k/N — 0 condition is to ensure that all the k nearest 
neighbors are close to x; the k — oo is to ensure that there are enough close 
neighbors so that the majority vote is almost always the optimal decision. 

Few practitioners would be willing to blindly set k = |VN]. They would 
much rather the data spoke for itself. Validation, or cross validation (Chap- 
ter 4) are good ways to select k. Validation to select a nearest neighbor 
classifier would run as follows. Randomly divide the data D into a training set 
Dirain and a validation set Dya] of sizes N — K and K respectively (the size 
of the validation set K is not to be confused with k, the number of nearest 
neighbors to use). Let Htrain = {91, - --, gy- g} be N—K hypotheses’ defined 
by the k-NN rules obtained from Dirain, for k = 1,..., N—K. Now use Dyal to 
select one hypothesis g` from Htrain (the one with minimum validation error). 
Similarly, one could use cross validation instead of validation. 


Exercise 6.5 


Consider using validation to select a nearest neighbor rule (hypothesis g` 
from Htrain). Let gz be the hypothesis in Herain with minimum Eout. 


(a) Show that if K/log(N — K) — oo then validation chooses a good 
hypothesis, Hout(g”) ~ Eout(gz). Formally state such a result and 
show it to be true. [Hint: The statement has to be probabilistic; 
use the Hoeffding bound and the fact that choosing g amounts to 
selecting a hypothesis from among N — K using a data set of size 
K] 

(b) If also N — K — on, then show that Eout(g) —> Euas (validation 


results in near optimal performance). [Hint: Use (a) together with 
Theorem 6.2 which shows that some value of k is good.] 


Note that the selected g is not a nearest neighbor rule on the full data D; 
it is a nearest neighbor rule using data Dtrain, and K~ neighbors. Would the 
performance improve if we used the &-NN rule on the full data set D? 


The previous exercise shows that validation works if the validation set size is 
appropriately chosen, for example, proportional to N. Figure 6.3 illustrates 


8The g` notation was introduced in Chapter 4 to denote a hypothesis built from the data 
set that is missing some data points. 
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(a) In-sample error of k-NN rule (b) Out-of-sample error of k-NN rule 


Figure 6.3: In and out-of-sample errors for various choices of k for the 1 
vs all other digits classification task. (a) shows the in-sample error; (b) 
the out-of-sample error. N points are randomly selected for training, and 
the remaining for computing the test error. The performance is averaged 
over 10,000 such training-testing splits. Observe that the 3-NN is almost 
as good as anything; the in and out-of sample errors for k = N 1/2 are a 
close match, as expected from Theorem 6.2. The cross validation in-sample 
error is approaching the out-of-sample error. For k = 1 the in-sample error 
is zero, and is not shown. 


our main conclusions on the digits data using k = |V N], k chosen using 
10-fold cross validation and k fixed at 1,3. 


6.2.3 Improving the Efficiency of Nearest Neighbor 


The simplicity of the nearest neighbor rule is both a virtue and a vice. There’s 
no training cost, but we pay for this when predicting on a test input during 
deployment. For a test point x, we need to compute the k nearest neighbors. 
This means we need access to all the data, a memory requirement of Nd 
and a computational complexity of O(Nd + N log k) (using appropriate data 
structures to compute N distances and pick the smallest k). This can be 
excessive as the next exercise illustrates. 


Exercise 6.6 


We want to select the value of k = 1,3,5,..., 2| Xo} | — 1 using 10-fold 
cross validation. Show that the running time is O(N°d + N? log N) 


The basic approach to addressing these issues is to preprocess the data in some 
way. This can help by either storing the data in a specialized data structure so 
that the nearest neighbor queries are more efficient, or by reducing the amount 
of data that needs to be kept (which helps with the memory and computational 
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requirements when deploying). These are active research areas, and we discuss 
the basic approaches here. 


Data Condensing The goal of data condensing is to reduce the data set 
size while making sure the retained data set somehow matches the original full 
data set. We denote the condensed data set by S. We will only consider the 
case when S is a subset of D.’ 

Fix k, and for the data set S, let gs be the k-NN rule derived from S and, 
similarly, gp is the k-NN rule derived from the full data set D. The condensed 
data set S is consistent with D if 


gs(x) =gp(x) Vee R*. 


This is a strong requirement, and there are computationally expensive algo- 
rithms which can achieve this (see Problem 6.11). A weaker requirement is 
that the condensed data only be training set consistent, which means that the 
condensed data and the full training data give the same classifications on the 
training data points, 


9s(Xn) = 9p (Xn) Yxn E€ D. 


This means that, as far as the training points are concerned, the rule derived 
from S is the same as the rule derived from D . Note that training set con- 
sistent does not mean Fin(gs) = 0. For starters, when k > 1, even the full 
data D may not achieve Ein (gp) = 0. 

The condensed nearest neighbor algo- 
rithm (CNN) is a very simple heuristic 
that results in a training set consistent x 
subset of the data. We illustrate with a 
data set of 8 points on the right, setting k 
to 3. Initialize the working set S randomly E 
with k data points from D (the boxed 
points illustrated on the right). As long iX: 
as S is not training set consistent, aug- = 
ment S with a data point not in S as fol- 
lows. Let x, be any data point for which O Lame 
gs(X«) # gp(x«), such as the solid blue 
circle in the example. This solid blue point is classified blue by S (in fact, 
every point is classified blue by S because S only contains three points, two of 
which are blue); but, it is classified red by D (because two of its three nearest 
points in D are red). So gs(x«) # gp(x.). Let x’ be the nearest data point to 
x, which is not in S and has class gp(x.). Note that x, could be in S. The 
reader can easily verify that such an x’ must exist because x, is inconsistently 
classified by S. Add x’ to S. Continue adding to S in this way until S is 
training set consistent. (Note that D is training set consistent, by definition.) 


add this point 


°More generally, if S contains arbitrary points, S is called a condensed set of prototypes. 
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Figure 6.4: The condensed data and resulting classifiers from running the 
CNN heuristic to condense the 500 digits data points used in Figure 6.2. 
(a) Condensation to 27 data points for 1-NN training set consistent (b) 
Condensation to 40 data points for 21-NN training set consistent. 


The CNN algorithm must terminate after at most N steps, and when it 
terminates, the resulting set S must be training set consistent. The CNN al- 
gorithm delivers good condensation in practice as illustrated by the condensed 
data in Figure 6.4 as compared with the full data in Figure 6.2. 


Exercise 6.7 


Show the following properties of the CNN heuristic. Let S be the current 
set of points selected by the heuristic. 


(a) If S is not training set consistent, and if x, is a point which is not 
training set consistent, show that the CNN heuristic will always find 
a point to add to S. 


(b) Show that the point added will ‘help’ with the classification of x, 
by S; it suffices to show that the new k nearest neighbors to x. in S 
will contain the point added. 


(c) Show that after at most N — k iterations the CNN heuristic must 
terminate with a training set consistent S. 


The CNN algorithm will generally not produce a training consistent set of 
minimum cardinality. Further, the condensed set can vary depending on the 
order in which data points are considered, and the process is quite computa- 
tionally expensive; but, at least it is done only once, and it saves every time a 
new point is to be classified. 

We should distinguish data condensing from data editing. The former has a 
purely computational motive, to reduce the size of the data set while keeping 
the decision boundary nearly unchanged. The latter edits the data set to 
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improve prediction performance. This typically means remove the points that 
you think are noisy (see Chapter 9). One way to determine if a point is noisy is 
if it disagrees with the majority class in its k-neighborhood. Such data points 
are removed from the data and the remaining data are used with the 1-NN 
rule. A very similar effect can be accomplished by just using the k-NN rule 
with the original data. 


Efficient Search for the Nearest Neighbors The other main approach 
to alleviating the computational burden is 
to preprocess the data in some way so that 
nearest neighbors can be found quickly. 
There is a large and growing volume of 
literature in this arena. Our goal here is 
to describe the basic idea with a simple 
technique. 

Suppose we want to search for the 
nearest neighbor to the point x (red). As- 
sume that the data is partitioned into clus- 
ters (or branches). In our illustration, we 
have two clusters Sı and S2. Each cluster 
has a ‘center’ uj and a radius rj. First 
search the cluster whose center is clos- 
est to the query point x. In the exam- 
ple, we search cluster S$; because ||x — p11|| < ||x — all. Suppose that the 
nearest neighbor to x in Sı is Xj), which is potentially xj), the nearest 
point to x in all the data. For any x, € S2, by the triangle inequality, 
|x —Xp|| > ||x — wall — r2. Thus, if the bound condition 





IX = ål] < [lx = mall = r2 


holds, then we are done (we don’t need to search S2), and can claim that $j) 
is a nearest neighbor (xq) = Xj). We gain a computational saving if the 
bound condition holds. This approach is called a branch and bound technique 
for finding the nearest neighbor. Let’s informally consider when we are likely 
to satisfy the bound condition, and hence obtain computational saving. By 
the triangle inequality, |x — Xfj|| < |x — ml] +71, so, if |x- mll +ri < 
|x — pt2|| — r2, then the bound condition is certainly satisfied. To interpret 
this sufficient condition, consider the case when x is much closer to p1, in 
which case ||x — p41 || ~ 0 and ||x — pall ~ ||u1 — pta||. Then we want 


mtr < || ua = Holl 


to ensure the bound holds. In words, the radii of the clusters (the within 
cluster spread) should be much smaller than the distance between clusters 
(the between cluster spread). A good clustering algorithm should produce ex- 
actly such clusters, with small within cluster spread and large between cluster 
spread. 
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There is nothing to prevent us from applying the branch and bound tech- 
nique recursively within each cluster. So, for example, to search for the nearest 
neighbor inside S1, we can assume that Sı has been partitioned into (say) two 
clusters S11 and S12, and so on. 


Exercise 6.8 


(a) Give an algorithmic pseudo code for the recursive branch and bound 
search for the nearest neighbor, assuming that every cluster with more 
than 2 points is split into 2 sub-clusters. 


(b) Assume the sub-clusters of a cluster are balanced, i.e. contain exactly 
half the points in the cluster. What is the maximum depth of the 
recursive search for a nearest neighbor. (Assume the number of data 
points is a power of 2). 


(c) Assume balanced sub-clusters and that the bound condition always 
holds. Show that the time to find the nearest neighbor is O(d log N). 


(d) How would you apply the branch and bound technique to finding the 
k-nearest neighbors as opposed to the single nearest neighbor? 


The computational savings of the branch and bound algorithm depend on 
how likely it is that the bound condition is met, which depends on how good 
the partitioning is. We observe that the clusters should be balanced (so that 
the recursion depth is not large), well separated and have small radii (so the 
bound condition will often hold). A very simple heuristic for getting such 
a good partition into M clusters is based on a clustering algorithm we will 
discuss later. Here is the main idea. 


First create a set of M well separated 
centers for the clusters; we recommend a 
simple greedy approach. Start by taking 
an arbitrary data point as a center. Now, 
as a second center, add the point furthest 
from the first center; to get each new cen- 
ter, keep adding the point furthest from 
the current set of centers until you have 
M centers (the distance of a point from 
a set is the distance to its nearest neigh- 
bor in the set). In the example, the red 
point is the first (random) center added; 
the largest blue point is added next and 
so on. This greedy algorithm can be im- 
plemented in O(M Nd) time using appropriate data structures. The partition 
is inferred from the Voronoi regions of each center: all points in a center’s 
Voronoi region belong to that center’s cluster. The center of each cluster is re- 
defined as the average data point in the cluster, and its radius is the maximum 








© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:6—16 











e-6. SIMILARITY-BASED METHODS 6.2. NEAREST NEIGHBOR 


distance from the center to a point in the cluster, 
1 . 
w= D si ry = mag [en = ml 


We can improve the clusters by obtain- 
ing new Voronoi regions defined by these 
new centers uj, and then updating the 
centers and radii appropriately for these 
new clusters. Naturally, we can continue 
this process iteratively (we introduce this 
algorithm later in the chapter as Lloyd’s 
algorithm for clustering). The main goal 
here is computational efficiency, and the 
perfect partition is not necessary. The 
partition, centers and radii after one re- 
finement are illustrated on the right. Note 
that though the clusters are disjoint, the 
spheres enclosing the clusters need not be. 

The clusters, together with u; and r; can be computed in O(M Nd) time. 
If the depth of the partitioning is O(log N), which is the case if the clusters 
are nearly balanced, then the total run time to compute all the clusters at 
every level is O(MNdlog N). In Problem 6.16 you can experiment with the 
performance gains from such simple branch and bound techniques. 





6.2.4 Nearest Neighbor is Nonparametric 


There is some debate in the literature on parametric versus nonparametric 
learning models. Having covered linear models, and nearest neighbor, we are 
in a position to articulate some of this discussion. We begin by identifying some 
of the defining characteristics of parametric versus nonparametric models. 

Nonparametric methods have no explicit parameters to be learned from 
data; they are, in some sense, general and expressive or flexible; and, they typi- 
cally store the training data, which are then used during deployment to predict 
on a test point. Parametric methods, on the other hand, explicitly learn cer- 
tain parameters from data. These parameters specify the final hypothesis from 
a typically ‘restrictive’ parameterized functional form; the learned parameters 
are sufficient for prediction, i.e. the parameters summarize the data which is 
then not needed later during deployment. 

To further highlight the distinction between these two types of techniques, 
consider the nearest neighbor rule versus the linear model. The nearest neigh- 
bor rule is the prototypical nonparametric method. There are no parameters 
to be learned. Yet, the nonparametric nearest neighbor method is flexible and 
general: the final decision rule can be very complex, or not, molding its shape 
to the data. The linear model, which we saw in Chapter 3 is the prototypical 
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parametric method. In d dimensions, there are d+ 1 parameters (the weights 
and bias) which need to be learned from the data. These parameters sum- 
marize the data, and after learning, the data can be discarded: prediction 
only needs the learned parameters. The linear model is quite rigid, in that no 
matter what the data, you will always get a linear separator as g. Figure 6.5 
illustrates this difference in flexibility on a toy example. 





(a) Nonparametric NN (b) Parametric linear 


Figure 6.5: The decision boundary of the ‘flexible’ nonparametric nearest 
neighbor rule molds itself to the data, whereas the ‘rigid’ parametric linear 
model will always give a linear separator. 


The k-nearest neighbor method would also be considered nonparametric 
(once the ‘parameter’ k has been specified). Theorem 6.2 is an example of 
a general convergence result.'° Under mild regularity conditions, no matter 
what the target f is, we can recover it as N — co, provided that k is chosen 
appropriately. That’s quite a powerful statement about such a simple learning 
model applied to learning a general f. Such convergence results under mild 
assumptions on f are a trademark of nonparametric methods. This has led 
to the folklore that nonparametric methods are, in some sense, more powerful 
than their cousins the parametric methods: for the parametric linear model, 
only if the target f happens to be in the linearly parameterized hypothesis 
set, can one get such convergence to f with larger N. 

To complicate the distinction between the two methods, let’s look at the 
non-linear feature transform (e.g. the polynomial transform). As the poly- 
nomial order increases, the number of parameters to be estimated increases 
and H, the hypothesis set, gets more and more expressive. It is possible to 
choose the polynomial order to increase with N, but not too quickly so that H 
gets more and more expressive, eventually capturing any fixed target and yet 


10For the technically oriented, it establishes the universal consistency of the k-NN rule. 
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the number of parameters grows slowly enough that it is possible to effectively 
learn them from the data. Such a model has properties of both the non- 
parametric and parametric settings: you still need to learn parameters (like 
parametric models) but the hypothesis set is getting more and more expressive 
so that you do get convergence to f. As a result, such techniques have some- 
times been called parametric, and sometimes nonparametric, neither being a 
satisfactory adjective; we prefer the term semi-parametric. Many techniques 
are semi-parametric, for example Neural Networks (see Chapter 7). 

In applications, the purely parametric model (like the linear model) has 
become synonymous with ‘specialized’ in that you have thought very carefully 
about the features and the form of the model and you are confident that the 
specialized linear model will work. To take another example from interest 
rate prediction, a well known specialized parametric model is the multi-factor 
Vasicek model. If we had no idea what the appropriate model is for interest 
rate prediction, then rather than commit to some specialized model, one might 
use a ‘generic’ nonparametric or semi-parametric model such as k-NN or neural 
networks. In this setting, the use of the more generic model, by being more 
flexible, accommodates our inability to pin down the specific form of the target 
function. We stress that the availability of ‘generic’ nonparametric and semi- 
parametric models is not a license to stop thinking about the right features 
and form of the model; it is just a backstop that is available when the target 
is not easy to explicitly model parametrically. 


6.2.5 Multiclass Data 


The nearest neighbor method is an elegant platform to study an issue which, 
unto now, we have ignored. In our digits data, there are not 2 classes (+1), but 
10 classes (0,1,...,9). This is a multiclass problem. Several approaches to the 
multiclass problem decompose it into a sequence of 2 class problems. For exam- 
ple, first classify between ‘1’ and ‘not 1’; if you classify as ‘1’ then you are done. 
Otherwise determine ‘2’ versus ‘not 2’; 
this process continues until you have nar- 
rowed down the digit. Several complica- 
tions arise in determining what is the best 
sequence of 2 class problems; and, they 
need not be one versus all (they can be 
any subset versus its complement). 








The nearest neighbor method offers a 
simple solution. The class of the test point 
x is simply the class of its nearest neigh- 
bor xj). One can also generalize to the k- 
nearest neighbor rule: the class of x is the 
majority class among the k-nearest neigh- 
bors X{1]; - - <, Xa], breaking ties randomly. 
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Figure 6.6: 21-NN decision classifier for the multiclass digits data. 


Exercise 6.9 


With C classes labeled 1,...,C, define 7-(x) = P{c|x] (the probability to 
observe class c given x, analogous to 7(x)). Let n(x) = 1 — max, 7¢(x). 


(a) Define a target f(x) = argmax, 7-(x). Show that, on a test point x, 
f attains the minimum possible error probability of 
e(f(x)) = PIF (x) # y] = n(x). 
(b) Show that for the nearest neighbor rule (k = 1), with high probability, 
the final hypothesis gn achieves an error on the test point x that is 


Cc 


e(gn(x)) =? $ me(x)(1 — me(x)). 


e= 


z 


(c) Hence, show that for large enough N, with high probability, 


C 


Eou(gN) < 2E out — Gay E). 
Hint: Show that Y`, a? > a? + E for any a > -> ac >20 
i C-1 y 


and aa) 


The previous exercise shows that even for the multiclass problem, the simple 
nearest neighbor is at most a factor of 2 from optimal. One can also obtain 
a universal consistency result as in Theorem 6.2. The result of running a 
21-nearest neighbor rule on the digits data is illustrated in Figure 6.6. 


The Confusion Matrix. The probability of misclassification which we dis- 
cussed in the 2-class problem can be generalized to a confusion matriz which 
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Table 6.1: Out-of-sample confusion matrix for the 21-NN rule on the 10- 
class digits problem (all numbers are in %). The bold entries along the 
diagonal are the correct classifications. Practitioners sometimes work with 
the class conditional confusion matrix which is P[c2|c1], the probability that 
g classifies c2 given that the true class is c;. The class conditional confusion 
matrix is obtained from the confusion matrix by dividing each row by the 
sum of the elements in the row. Sometimes the class conditional confusion 
matrix is easier to interpret because the ideal case is a diagonal matrix with 
all diagonals equal to 100%. 


is a better representation of the performance, especially when there are mul- 
tiple classes. The confusion matrix tells us not only the probability of error, 
but also the nature of the error, namely which classes are confused with each 
other. Analogous to the misclassification error, one has the in-sample and out- 
of-sample confusion matrices. The in-sample (resp. out-of-sample) confusion 
matrix is a C x C matrix which measures how often one class is classified as 
another. So, the out-of-sample confusion matrix is 





Eout(c1,€2) = P{class cı is classified as c2 by g] 


Ox [Ta (x) g(x) = call] 
fax P(x)re (x) lg(x) = c2] . 














where [-] is the indicator function which equals 1 when its argument is true. 
Similarly, we can define the in-sample confusion matrix, 


En(ere)= 5 D (9m) = ol; 


MYn=C1 


where the sum is over all data points with classification c1, and the summand 
counts the number of those cases that g classifies as c2. The out-of-sample 
confusion matrix obtained by running the 21-NN rule on the 10-class digits 
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data is shown in Table 6.1. The sum of the diagonal elements gives the prob- 
ability of correct classification, which is about 42%. From Table 6.1, we can 
easily identify some of the digits which are commonly confused, for example 8 
is often classified as 0. In comparison, for the two class problem of classifying 
digit 1 versus all others, the success was upwards of 98%. For the multiclass 
problem, random performance would achieve a success rate of 10%, so the per- 
formance is significantly above random; however, it is significantly below the 
2-class version; multiclass problems are typically much harder than the two 
class problem. Though symmetry and intensity are two features that have 
enough information to distinguish between 1 versus the other digits, they are 
not powerful enough to solve the multiclass problem. Better features would 
certainly help. One can also gain by breaking up the problem into a sequence 
of 2-class tasks and tailoring the features to each 2-class problem using domain 
knowledge. 


6.2.6 Nearest Neighbor for Regression. 


With multiclass data, the natural way to combine the classes of the nearest 
neighbors to obtain the class of a test input is by using some form of majority 
voting procedure. When the output is a real number (y € R), the natural way 
to combine the outputs of the nearest neighbors is using some form of averag- 
ing. The simplest way to extend the k-nearest neighbor algorithm to regression 
is to take the average of the target values from the k-nearest neighbors: 


x| = 


g(x) = 


k 
3 ya (Xx). 


There are no explicit parameters being learned, and so this is a nonparametric 
regression technique. Figure 6.7 illustrates the k-NN technique for regression 
using a toy data set generated by the target function in light gray. 





Figure 6.7: k-NN for regression on a noisy toy data set with N = 20. 
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Exercise 6.10 
You are using k-NN for regression (Figure 6.7). 
(a) Show that Ein is zero when k = 1. 
(b) Why is the final hypothesis not smooth, making step-like jumps? 


(c) How does the choice of k control how well g approximates f? (Con- 
sider the cases k too small and k too large.) 





(d) How does the final hypothesis g behave when x — oo. 


Outputting a Probability with k-NN. Recall that for binary classifica- 
tion, we took sign(g(x)) as the final hypothesis. We can also get a natural 
extension of the nearest neighbor technique outputting a probability. The final 
hypothesis is 


which is just the fraction of the k nearest neighbors that are classified +1. 


For both regression and logistic regression, as with classification, if N — oo 
and k — oo with k/N — 0, then, under mild assumptions, you will recover a 
close approximation to the target function: g(x) — f(x) (as N grows). This 
convergence is true even if the data (y values) are noisy, with bounded noise 
variance, because the averaging can overcome that noise when k gets large. 


6.3 Radial Basis Functions 


In the k-nearest neighbor rule, only k neighbors influence the prediction at a 
test point x. Rather than fix the value of k, the radial basis function (RBF) 
technique allows all data points to contribute to g(x), but not equally. Data 
points further from x contribute less; a radial basis function (or kernel) ¢ 
quantifies how the contribution decays as the distance increases. The contri- 
bution of x, to the classification at x will be proportional to (||x — x,||), and 
so the properties of a typical kernel are that it is positive and non-increasing. 
The most popular kernel is the Gaussian kernel, 


Another common kernel is the window kernel, 
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0 zg>1, 


TEF B 


It is common to normalize the kernel so that it integrates to 1 over R. For 
the Gaussian kernel, the normalization constant is (27)~¢/?; for the window 
kernel, the normalization constant is t~?/?P'($d+1), where T (-) is the Gamma 
function. The normalization constant is not important for RBFs, but for 
density estimation (see later), the normalization will be needed. 

One final ingredient is the scale r which specifies the width of the kernel. 
The scale determines the ‘unit’ of length against which distances are measured. 
If |x — x,|| is small relative to r, then xn will have a significant influence on 
the value of g(x). We denote the influence of x, at x by an(x), 


an(x) = ¢ (==) l 


T 


The ‘radial’ in RBF is because the influence only depends on the radial distance 
|x — xn ||. We compute g(x) as the weighted sum of the y-values: 


Ei On (x)yn 
E1 am(x) 


where the sum in the denominator ensures that the sum of the weights is 1. The 
hypothesis (6.1) is the direct analog of the nearest neighbor rule for regression. 
For classification, the final output is sign(g(x)); and, for logistic regression the 
output is 0(g(x)), where 6(s) = e*/(1 + e°). 

The scale parameter r determines the relative emphasis on nearby points. 
The smaller the scale parameter, the more emphasis is placed on the nearest 
points. Figure 6.8 illustrates how the final hypothesis depends on r and the 
next exercise shows that when r — 0, the RBF is the nearest neighbor rule. 


g(x) = (6.1) 


Exercise 6.11 


When r —> 0, show that for the Gaussian kernel, the RBF final hypothesis 
is g(x) = yj], the same as the nearest neighbor rule. 


N 
È Yenj Un /O] 
[Hint: g(x) = = and show that ajn/ap) > 0 forn £1.] 


Om /Q1] 
il 


Il 
iM=|" 


As r gets large, the kernel width gets larger and more of the data points 
contribute to the value of g(x). As a result, the final hypothesis gets smoother. 
The scale parameter r plays a similar role to the number of neighbors k in the 
k-NN rule; in fact, the RBF technique with the window kernel is sometimes 
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Figure 6.8: The nonparametric RBF using the same noisy data from Fig- 
ure 6.7. Too small an r results in a complex hypothesis that overfits. Too 
large an r results in an excessively smooth hypothesis that underfits. 


called the r-nearest neighbor rule, because it uniformly weights all neighbors 
within distance r of x. One way to choose r is using cross validation. Another 
is to use Theorem 6.2 as a guide. Roughly speaking, the neighbors within 
distance r of x contribute to g(x). The volume of this region is order of r@, 
and so the number of points in this region is order of Nr“; thus, using scale 
parameter r effectively corresponds to using a number of neighbors k that is 
order of Nrt. Since we want k/N — oo and k — œo as N grows, one effective 
choice for r is ( YN) E, which corresponds to k ~ VN. 


6.3.1 Radial Basis Function Networks 


There are two ways to interpret the RBF in equation (6.1). The first is as 
a weighted sum of y-values as presented in equation (6.1), where the weights 
are Q,. This corresponds to centering a single kernel, or a bump, at x. This 
bump decays as you move away from x and the value the bump attains at 
Xn determines the weight an (see Figure 6.9(a)). The alternative view is to 
rewrite Equation (6.1) as 


= ues (==, (6.2) 


where wn (xX) = Yn/ ae ¢ (||x — x;||/r). This version corresponds to center- 
ing a bump at every xn. The bump at x, has a height w,,(x). As you move 
away from Xn, the bump decays, and the value the bump attains at x is the 
contribution of xn to g(x) (see Figure 6.9(b)). 

With the second interpretation, g(x) is the sum of N bumps of different 
heights, where each bump is centered on a data point. The width of each 
bump is determined by r. The height of the bump centered on x, is wn(x), 
which varies depending on the test point. If we fix the bump heights to wn, 
independent of the test point, then the same bumps centered on the data 
points apply to every test point x and the functional form simplifies. Define 
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> > 
X (Xn, Yn) x (Xn, Yn) 
(a) Bump centered on x (b) Bumps centered on xn, 


Figure 6.9: Different views of the RBF. (a) g(x) is a weighted sum of yn 
with weights a, determined by a bump centered on x. (b) g(x) is a sum of 
bumps, one on each xn having height wn. 


®,,(x) = (|x — Xn||/r). Then our hypotheses have the form 


N 
h(x) = X` waGn (x), (6.3) 


n=1 


where now wpn are constants (parameters) that we get to choose. We switched 
over to h(x) because as of yet, we do not know what the final hypothesis g is 
until we specify the weights w1,..., wmn, whereas in (6.2), the weights wn (x) 
are specified. Observe that (6.3) is just a linear model h(x) = w'z with the 
nonlinear transform to an N-dimensional space given by 


Pı (x) 


y(x) 


The nonlinear transform is determined by the kernel and the data points 
X1,---,Xn- Since the nonlinear transform is obtained by placing a kernel 
bump on each data point, these nonlinear transforms are often called local 
basis functions. Notice that the nonlinear transform is not known ahead of 
time. It is only known once the data points are seen. 

In the nonparametric RBF, the weights w,,(x) were specified by the data 
and there was nothing to learn. Now that wn are parameters, we need to learn 
their values (the new technique is parametric). As we did with linear models, 
the natural way to set the parameters wp is to fit the data by minimizing the 
in-sample error. Since we have N parameters, we should be able to fit the 
data exactly. Figure 6.10 illustrates the parametric RBF as compared to the 
nonparametric RBF, highlighting some of the differences between taking g(x) 
as the weighted sum of y-values scaled by a bump at x versus the sum of fixed 
bumps at each data point xp. 
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T 


(a) Nonparametric RBF, small r (b) Nonparametric RBF, large r 








T 


(c) Parametric RBF, small r (b) Parametric RBF, large r 





Figure 6.10: (a),(b) The nonparametric RBF; (c),(d) The parametric RBF. 
The toy data set has 3 points. The width r controls the smoothness of the 
hypothesis. The nonparametric RBF is step-like with large |x| behavior 
determined by the target values at the peripheral data; the parametric RBF 
is bump-like with large x behavior that converges to zero. 


Exercise 6.12 


(a) For the Gaussian kernel, what is g(x) as ||x|| —> oo for the nonpara- 
metric RBF versus for the parametric RBF with fixed wn? 

(b) Let Z be the square feature matrix defined by Znj = ®;(xn). Assume 
Z is invertible. Show that g(x) = w™®(x), with w = Z 'y exactly 
interpolates the data points. That is, g(xn) = yn, giving Ein(g) = 0. 


(c) Does the nonparametric RBF always have Ein = 0? 


The parametric RBF has N parameters w1,..., wyn, ensuring we can always 
fit the data. When the data has stochastic or deterministic noise, this means 
we will overfit. The root of the problem is that we have too many parameters, 
one for each bump. Solution: restrict the number of bumps to k < N. If we 
restrict the number of bumps to k, then only k weights w1,..., wz need to be 
learned. Naturally, we will choose the weights that best fit the data. 

Where should the bumps be placed? Previously, with N bumps, this was 
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not an issue since we placed one bump on each data point. Though we cannot 
place a bump on each data point, we can still try to choose the bump centers so 
that they represent the data points as closely as possible. Denote the centers 
of the bumps by ,41,..., Hk. Then, a hypothesis has the parameterized form 


k 
h(x) = w + > wid (=) = w'®(x), (6.4) 


j=1 


where ®(x)" = [1, ®i(x),...,®x(x)] and ®;(x) = ¢(||x — a;||/r). Observe 
that we have added back the bias term wy.!! For the nonparametric RBF or 
the parametric RBF with N bumps, we did not need the bias term. However, 
when you only have a small number of bumps and no bias term, the learning 
gets distorted if the y-values have non-zero mean (as is also the case if you do 
linear regression without the constant term). 

The unknowns that one has to learn by fitting the data are the weights 
wo, W1,..., Wp and the bump centers H1, ..., Hp (which are vectors in R2). 
The model we have specified is called the radial basis function network (RBF- 
network). We can get the RBF-network model for classification by taking the 
sign of the output signal, and for logistic regression we pass the output signal 
through the sigmoid @(s) = e*/(1 + e°). 

A useful graphical representation of 


the model as a ‘feed forward’ network is h(x) 
illustrated. There are several important 
features that are worth discussing. First, ©) 


the hypothesis set is very similar to a lin- 
ear model except that the transformed fea- 0 @, Wk 
tures ®;(x) can depend on the data set 


Wj 
(through the choice of the p; which are D a a 


chosen to fit the data). In the linear 


model of Chapter 3, the transformed fea- \Ix— p11 (ee pl \|x— pra 
tures were fixed ahead of time. Because r ip 
the uj appear inside a nonlinear function, (x) 


this is our first model that is not linear in 

its parameters. It is linear in the wj, but 

nonlinear in the uj. It turns out that allowing the the basis functions to de- 
pend on the data adds significant power to this model over the linear model. 
We discuss this issue in more detail later, in Chapter 7. 

Other than the parameters wj and ys; which are chosen to fit the data, 
there are two high-level parameters k and r which specify the nature of the 
hypothesis set. These parameters control two aspects of the hypothesis set 
that were discussed in Chapter 5, namely the size of a hypothesis set (quan- 
tified by k, the number of bumps) and the complexity of a single hypothesis 
(quantified by r, which determines how ‘wiggly’ an individual hypothesis is). 


11 An alternative notation for the scale parameter 1/r is y, and for the bias term wo is b 
(often used in the context of SVM, see Chapter 8). 
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It is important to choose a good value of k to avoid overfitting (too high a k) 
or underfitting (too low a k). A good strategy for choosing k is using cross 
validation. For a given k, a good simple heuristic for setting r is 


r= TIa’ 

where R = max;,; ||x; — x; || is the ‘diameter’ of the data. The rationale for this 
choice of r is that k spheres of radius r have a volume equal to cgkr? = ca R? 
which is about the volume of the data (cq is a constant depending only on d). 
So, the k bumps can ‘cover’ the data with this choice of r. 


6.3.2 Fitting the Data 


To complete our specification of the RBF-network model, we need to discuss 
how to fit the data. For a given k and r, to obtain our final hypothesis g, 
we need to determine the centers uj and the weights wj. In addition to the 
weights, we appear to have an abundance of parameters in the centers pj. 
Fortunately, we will determine the centers by choosing them to represent the 
inputs X1,..., Xy without reference to the y;,...,yn. The heavy lifting to fit 
the yn will be done by the weights, so if there is overfitting, it is mostly due 
to the flexibility to choose the weights, not the centers. 

For a given set of centers, uj, the hypothesis is linear in the w;; we can 
therefore use our favorite algorithm from Chapter 3 to fit the weights if the 
centers are fixed. The algorithm is summarized as follows. 


Fitting the RBF-network to the data (given k,r): 


1: Using the inputs X, determine k centers 41,..., Hk- 
2: Compute the N x (k + 1) feature matrix Z 


J = 2; (Xn) = [> = bil) 


Each row of Z is the RBF-feature corresponding to x, 
(where we have the dummy bias coordinate Zn, 9 = 1). 
3: Fit the linear model Zw to y to determine the weights w*. 





We briefly elaborate on step 3 in the above algorithm. 


e For classification, you can use one of the algorithms for linear models 
from Chapter 3 to find a set of weights w that separate the positive 
points in Z from the negative points; the linear programming algorithm 
in Problem 3.6 would also work well. 

e For regression, you can use the analytic pseudo-inverse algorithm 


w* = Zy = (Z"Z)"'Z'y, 
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where the last expression is valid when Z has full column rank. Recall 
also the discussion about overfitting from Chapter 4. It is prudent to 
use regularization when there is stochastic or deterministic noise. With 
regularization parameter À, the solution becomes w* = (Z™Z+AI)~!Z’y; 
the regularization parameter can be selected with cross validation. 

e For logistic regression, you can obtain w* by minimizing the logistic re- 
gression cross entropy error, using, for example, gradient descent. More 
details can be found in Chapter 3. 


In all cases, the meat of the RBF-network is in determining the centers pj 
and computing the radial basis feature matrix Z. From that point on, it is a 
straightforward linear model. 


























> oO 
[e] 
ce) 
o—e e d d H—e z € 
T T 
(a)k=4 (b) k= 10 (c) k = 10, regularized 


Figure 6.11: The parametric RBF-network using the same noisy data from 
Figure 6.7. (a) Small k. (b) Large k results in overfitting. (c) Large k with 
regularized linear fitting of weights (A = 0.05). The centers uj are the gray 
shaded circles and r = 1/k. The target function is the light gray curve. 


Figure 6.11 illustrates the RBF-network on the noisy data from Figure 6.7. 
We (arbitrarily) selected the centers u; by partitioning the data points into 
k equally sized groups and taking the average data point in each group as a 
center. When k = 4, one recovers a decent fit to f even though our centers 
were chosen somewhat arbitrarily. When k = 10, there is clearly overfitting, 
and regularization is called for. 

Given the centers, fitting a RBF-network reduces to a linear problem, for 
which we have good algorithms. We now address this first step, namely how 
to determine these centers. Determining the optimal bump centers is the hard 
part, because this is where the model becomes essentially nonlinear. Luckily, 
the physical interpretation of this hypothesis set as a set of bumps centered 
on the yt; helps us. When there were N bumps (the nonparametric case), 
the natural choice was to place one bump on top of each data point. Now 
that there are only k < N bumps, we should still choose the bump centers 
to closely represent the inputs in the data (we only need to match the xn, so 
we do not need to know the yn). This is an unsupervised learning problem; in 
fact, a very important one. 
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One way to formulate the task is to require that no x, be far away from a 
bump center. The x, should cluster around the centers, with each center pj 
representing one cluster. This is known as the k-center problem, known to be 
NP-hard. Nevertheless, we need a computationally efficient way to partition 
the data into k clusters and pick representative centers { by} Fy, one center for 
each cluster. One thing we learn from Figure 6.11 is that even for somewhat 
arbitrarily selected centers, the final result is still quite good, and so we don’t 
have to find the absolutely optimal way to represent the data using k centers; 
a decent approximation will do. The k-means clustering algorithm is one way 
to pick a set of centers that are a decent representation of the data. 


6.3.3 Unsupervised k-Means Clustering 


Clustering is one of the classic unsuper- 
vised learning tasks. The k-means clus- 
tering algorithm is a simple yet powerful 
tool for obtaining clusters and cluster cen- 
ters. We illustrate a clustering of 500 of 
the digits data into ten clusters. It is in- 
structive to compare the clusters obtained 
here in an unsupervised manner with the 
classification regions of the k-NN rule in 
Figure 6.6(b). The clusters seem to mimic 
to some extent the classification regions, 
indicating that there is significant infor- 
mation in the input data alone. 

We already saw the basic algorithm when we discussed partitioning for 
the branch and bound approach to finding the nearest neighbor. The goal 
of k-means clustering is to partition the input data points x,,...,xy into 
k sets S1,...,S, and select centers p1,..., Hp for each cluster. The centers 
are representative of the data if every data point in cluster S; is close to its 
corresponding center uj. For cluster S; with center uj, define the squared 
error measure Ej to quantify the quality of the cluster, 


2 
Ej = $ [xn l’. 


XnESj 

















The error Æj measures how well the center u; approximates the points in S;. 
The k-means error function just sums this cluster error over all clusters, 


k N 
Ein(S1 2 Ski An- e) = >) E; = Do Xn — u(x), (6.5) 


j=1 n=1 


where u(xn) is the center of the cluster to which x, belongs. We seek the 
partition S1,..., Sk and centers H1,..., Hk that minimize the k-means error. 
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Exercise 6.13 


(a) Fix the clusters to S1,...,S%. Show that the centers which minimize 
Ein(S1,-.-, Sk; 1,---, Hk) are the centroids of the clusters: 
1 
mij = S] a Xn- 
J Xn ES; 
(b) Fix the centers to p1,..., Hp. Show that the clusters which minimize 


Ein(S1,..-, Sk; H1,- - - , Hk) are obtained by placing into S; all points 
for which the closest center is mj, breaking ties arbitrarily: 


Sj = {xn : Xn — byl] < [ln — pell for £= 1... k}. 


Minimizing the k-means error is an NP-hard problem. However, the previous 
exercise says that if we fix a partition, then the optimal centers are easy to 
obtain. Similarly, if we fix the centers, then the optimal partition is easy 
to obtain. This suggests a very simple iterative algorithm which is known as 
Lloyd’s algorithm. The algorithm starts with candidate centers and iteratively 
improves them until convergence. 


Lloyd’s Algorithm for k-Means Clustering: 


: Initialize u; (e.g. using the greedy approach on page 16). 


: Construct S; to be all points closest to p;. 
: Update each uj to equal the centroid of S}. 
: Repeat steps 2 and 3 until Ei, stops decreasing. 





Lloyd’s algorithm was the algorithm used to cluster the digits data into the 
ten clusters that were shown in the previous figure. 


Exercise 6.14 


Show that steps 2 and 3 in Lloyd's algorithm can never increase Ein, and 
hence that the algorithm must eventually stop iterating. [Hint: There are 
only a finite number of different partitions] 


Lloyd’s algorithm produces a partition which is ‘locally optimal’ in an unusual 
sense: given the centers, there is no better way to assign the points to clusters 
defined by the centers; and, given the partition, there is no better choice for 
the centers. However, the algorithm need not find the optimal partition as 
one might be able to improve the k-means error by simultaneously changing 
the centers and the cluster memberships. Lloyd’s algorithm falls into a class 
of algorithms known as E-M (expectation-maximization) algorithms. It min- 
imizes a complex error function by separating the variables to be optimized 
into two sets. If one set is known, then it is easy to optimize the other set, 
which is the basis for an iterative algorithm, such as with Lloyd’s algorithm. It 
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(a) 10- center RBF-network (b) 300-center RBF-network 


Figure 6.12: The RBF-network model for classifying the digits data (‘1’ 
versus ‘not 1’) with (a) 10 centers and (b) 300 centers. The centers are 
first obtained by the k-means clustering algorithm. The resulting classifier 
is obtained after fitting the weights using the linear regression algorithm 
for classification (see Chapter 3). The figures highlight the possibility of 
overfitting with too many centers. 


is an on-going area of theoretical research to find algorithms which can guar- 
antee near optimal clustering. There are some algorithms which achieve this 
kind of accuracy and run in O(n20“) time (curse of dimensionality). For our 
practical purposes, we do not need the best clustering per se; we just need a 
good clustering that is representative of the data, because we still have some 
flexibility to fit the data by choosing the weights wj. Lloyd’s algorithm works 
just fine in practice, and it is quite efficient. In Figure 6.12, we show the RBF- 
network classifier for predicting ‘1’ versus ‘not 1’ when using 10 centers versus 
300 centers. In both cases, the cluster centers were determined using Lloyd’s 
algorithm. Just as with linear models, one can overfit with RBF-networks. To 
get the best classifier, one should use regularization and validation to combat 
overfitting. 


Selecting the Number of Clusters k. We introduced clustering as a 
means to determine the centers of the RBF-network in supervised learning. 
The ultimate goal is out-of-sample prediction, so we need to choose k (the 
number of centers) to give enough flexibility without overfitting. In our super- 
vised setting, we can minimize a cross validation error to choose k. However, 
more generally, clustering is a stand alone unsupervised task. With our digits 
data, a natural choice for k is ten, one cluster for each digit. Algorithms to 
choose k using only the unlabeled data without knowing the number of classes 
in the supervised task are also part of on-going research. 
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6.3.4 Justifications and Extensions of RBFs 


We ‘derived’ RBFs as a natural extension to the k-nearest neighbor algorithm; 
a soft version of k-nearest neighbors, if you will. There are other ways to 
derive them. RBF-networks arise naturally through Tikhonov regularization 
for fitting nonlinear functions to the data — Tikhonov regularization penalizes 
curvature in the final hypothesis (see Problem 6.20). RBF-networks also arise 
naturally from noisy interpolation theory where one tries to achieve minimum 
expected in-sample error under the assumption that the x-values are noisy; 
this is similar to regularization where one asks g to be nearly a constant equal 
to Yn in the neighborhood of xn. 

Our RBF-network had k identical bumps. A natural way to extend this 
model is to allow the bumps to have different shapes. This can be accomplished 
by choosing different scale parameters for each bump (so r; instead of r). 
Further, the bumps need not be spherically symmetric. For the Gaussian 
kernel, the hypothesis for this more general RBF-network has the form: 


k 1 T51 
h(x) =$ wy a yA 27 (x pi), (6.6) 


The weights wi the centers {uj FP (vectors in R4) and the shapes 
{Xj a (d x d positive definite symmetric covariance matrices) all need to 
be learned from the data. Intuitively, each bump 7 still represents a cluster 
of data centered at uj, but now the scale r is replaced by a ‘scale matrix’ 
X; which captures the scale and correlations of points in the cluster. For 
the Gaussian kernel, the uj are the cluster centroids, and the X; are the 
cluster covariance matrices. We can fit the w; using techniques for linear 
models, once the u; and X; are given. As with the simpler RBF-network, 
the location and shape parameters can be learned in an unsupervised way; 
an E-M algorithm similar to Lloyd’s algorithm can be applied (Section 6.4.1 
elaborates on the details). For Gaussian kernels, this unsupervised learning 
task is called learning a Gaussian Mixture Model (GMM). 


6.4 Probability Density Estimation 


Clustering was our first unsupervised method that tried to organize the input 
data. A cluster contains points that are similar to each other. The probability 
density of x is a generalization of clustering to a finer representation. Clusters 
can be thought of as regions of high probability. The basic task in probability 
density estimation is to estimate, for a given x, how likely it is that you would 
generate inputs similar to x. To answer this question we need to look at what 
fraction of the inputs in the data are similar to x. Since similarity plays an 
important role, many of the similarity-based techniques in supervised learning 
have counterparts in the unsupervised task of probability density estimation. 
Vast tomes have been written on this topic, so we only skim its surface. 
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The Histogram. Given data x1,..., XN 
generated independently from P(x), the 
task is to learn the entire probability den- 
sity P. Since P(x) is the density of points 
at x, the natural estimate is to use the in- 
sample density of points around x. The 
histogram density estimate illustrated on 
the right uses a partition of X into uniform 
disjoint hypercubes (multidimensional in- 
tervals). Every point x falls into one bin 
and each bin has volume V. In one dimen- 
sion, V is just the length of the interval. The estimate P(x) is the density of 
points in the hypercube that contains x, normalized so that the integral is 1. 


‘ 1 N(x) 
x)= —. 
N V” 


where N(x) is the number of data points in the hypercube containing x. The 
density of points in this hypercube containing x is N(x)/V. The normalization 
by 1/N is to ensure that f dx P(x) =1. The histogram estimate P(x) is not 
continuous. The main parameter which controls the quality of the estimate 
is V (or the number of bins, which is inversely related to V). Under mild 
assumptions, one recovers the true density as V > 0 and N-V —> œo. This 
ensures that you are only using points very similar to x to estimate the density 
at x and there are enough such points being used. One choice is V = 1//N. 








Nearest Neighbor Density Estimation. Rather than using uniform hy- 
percubes, one can use the distance to the kth nearest neighbor to determine the 
volume of the region containing x. Let rj,j(x) = ||x — xj,)|| be the distance 
from x to its kth nearest neighbor, and Vj,)(x) the volume of the spheroid 
centered at x of radius rj)(x). In d-dimensions, Vj.) and rj) are related by 


The density of points in this spheroid 
is k/Vix], so we have the estimate 





ee = Vag x)’ 


where c is chosen by normalizing P, so 
that [dx P(x) = 1. The 3-NN density 
estimate using 6 points from a uniform dis- 
tribution is illustrated to the right. As is 
evident, the nearest neighbor density esti- 
mate is continuous but not smooth, having sharp spikes and troughs. 
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x 


(a) 31-NN density estimate (b) Density contours 


Figure 6.13: The 31-NN density estimate for 500 digits data. (a) The 
probability density surface showing the spike behavior of the estimate. (b) 
Density contours (red is high, blue is low) from which we can visually identify 
the clusters in the data as regions of locally high density. 


The nearest neighbor density estimate only works with a bounded input 
space because otherwise the estimate is not normalizable. In fact the tail of the 
nearest neighbor density estimate is decaying as ||x||~“ which is fat-tailed. The 
nearest neighbor density estimate is nonparametric, and the usual convergence 
theorem holds. If k — oo and k/N — 0, then under mild assumptions on P, 
P converges to P (where the discrepancy between P and P is measured by the 
integrated squared error, [dx (P(x) — P(x))?). Figure 6.13 illustrates the 
nearest neighbor density estimate using 500 randomly sampled digits data. 


Parzen Windows: RBF Density Estimation. Perhaps the most com- 
mon density estimation technique is the Parzen window. Recall that the kernel 
¢ is a bump function which is positive. For density estimation, it is convenient 
to normalize ¢ so that f dx ¢(||x||) = 1. The normalized Gaussian kernel is 


1 
(2) = Gna 


1,2 
-32 


One can verify that P(x) = ¢(||x||) is a probability density, and for any r > 0, 


P= io (2L) 


is also a density (r is the width of the bump). The Parzen window is similar to 
the nonparametric RBF in supervised learning: you have a bump with weight 


+ on each data point, and Ê(x) is a sum of the bumps: 
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x 


(a) RBF density estimate (b) Density contours 


Figure 6.14: Gaussian kernel Parzen window with r = 0.02 using 500 digits 
data. (a) The probability density surface is smooth but bumpy around the 
periphery of the data; the tail is exponential, dropping very quickly to zero 
away from the data. Compare this with the spiky more slowly decaying 
density produced by nearest neighbor in Figure 6.13. (b) Density contours 
(red is high, blue is low) highlighting the clusters in the data as regions of 
locally high density. Clusters are more readily visible with the smoother 
Parzen window than the nearest neighbor estimate. 


Px) = str oo (2), 


Since each bump integrates to 1, the scal- 
ing by $ ensures that P(x) integrates 
to 1. The figure to the right illustrates 
the Gaussian kernel Parzen window with 
six points from a uniform density. As op- 
posed to the nearest neighbor density, the Parzen window is smooth, and has 
a thin tail. These properties are inherited from the Gaussian kernel, which 
is smooth and exponentially decaying. Figure 6.14 shows the Gaussian ker- 
nel Parzen window applied to our 500 randomly sampled digits data, with 
r = 0.02. Observe that the Parzen window can be bumpy around the periph- 
ery of the data. We can reduce this bumpiness by increasing r, which risks 
missing the finer structure at the heart of the data. 

As with nearest neighbors, the Parzen window is nonparametric, modulo 
the choice of the kernel-width. If r — 0 and r4N — œ as N grows, then under 
mild assumptions, Ê converges to P with respect to integrated square error. 
One choice for r is 1/ */N. One can also use cross validation to determine a 
good value of r that maximizes the likelihood of the validation set (recall that 
we used likelihood to derive the error function for logistic regression). 





P(x) 








© | Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:6—37 











e-6. SIMILARITY-BASED METHODS 6.4. PROBABILITY DENSITY ESTIMATION 


6.4.1 Gaussian Mixture Models (GMMs) 


Just as the RBF-network is the parametric version of the nonparametric RBF, 
the Gaussian mixture model (GMM) is the parametric version of the Parzen 
window density estimate. The Parzen window estimate places a Gaussian 
bump at every data point; the GMM places just k bumps at centers p1, ... , Hk- 
The Gaussian kernel is the most commonly used and easiest to handle. In d- 
dimensions, the Gaussian density with center yz and covariance matrix © is: 


1 —21(x — p)tTO 1 (x — 
Nea d) = Gaps 2 (x n) (x n), 














Note that the center p is also the expected value, E[x] = u; and the covariance 
between two features z;, zj is Xij, E[(x — w)(x — p)"] = È. 

To derive the GMM, we use the following simple model of how the data 
is generated. There are k Gaussian distributions, with respective means 
H1,- .., Hk and covariance matrices 44,..., 4%. To generate a data point x, 
first pick a Gaussian j € {1,...,k} according to probabilities {w1,..., Wk}, 
where wj > 0 and De wj = 1. After selecting Gaussian j, x is generated 
according to a Gaussian distribution with parameters uj, uj. The probability 
density of x given that you picked Gaussian j is 














: 1 —2(x — p;)TXT (x — u; 
Pli) = N G6 uy, d3) = Ga ee a(x Hj) J (x Mj). 
J 


By the law of total probability, the unconditional probability density for x is 


k 
= LPs P| = Sa (x; Hj, U5) (6.7) 


Compare (6.7) with (6.4). The GMM is just an RBF-network with weights wj 

and a Gaussian kernel — a sum of Gaussian bumps, hence the name. 
Each of the & Gaussian bumps represents 

a ‘cluster’ of the data. The probability den- 

sity in Equation (6.7) puts a bump (Gaussian) 

of total probability (weight) wj at the center 

Hj; Xj determines the ‘shape’ of the bump. 

An example of some data generated from a e ° 

GMM in two dimensions with k = 3 is shown 

on the right. The Gaussians are illustrated by 

a contour of constant probability. The differ- K 

ent shapes of the clusters are controlled by the 

covariances £j. Equation (6.7) is based on a 

generative model for the data, and the param- 

eters of the model need to be learned from the actual data. These parameters 

are the mixture weights wj, the centers uj and the covariance matrices Xj. 
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To determine the unknown parameters in the GMM, we will minimize an 
in-sample error called the likelihood. We already saw the likelihood when we 
derived the in-sample error for logistic regression in Chapter 3. Among all 
the possible GMMs, each determined by a different choice for the parameters 
in Equation (6.7), we want to select one. Our criterion for choosing is that, 
for the chosen density estimate P, the data should have a high probability 
of being generated. Since the data points are independent, the probability 


(density) for the data x,,...,xy if the data were generated according to P(x) 
is 
N N fk 
P(xi,..., XN) = [|] ?&) = II XO wN (xn; uj, Dy) 
n=1 n=1 \j=1 


This is the likelihood of a particular GMM specified by parameters {w;, 4,5; }. 
The method of maximum likelihood selects the parameters which maximize 
P(x, ...,Xy), or equivalently which minimize — ln P(x,,...,xy). Thus, we 
may minimize the in-sample error: 


Ein(wy, Hj, £3) E —In P(x1,...,xw) 


N k 
= -X in |X wN ns my, Ey) | - (6.8) 
i=l j=l 


To determine {w;, u;,4,;}, we need an algorithm to minimize this in-sample 
error. Equation (6.8) is general in that we can replace the Gaussian den- 
sity N (Xn; Hj, 4,7) with any other parameterized density and we will obtain a 
non-Gaussian mixture model. Unfortunately, even for the friendly Gaussian 
mixture model, the summation inside the logarithm makes it very difficult 
to minimize the in-sample error, and we are going to need a heuristic. This 
heuristic is the Expectation Maximization (E-M) algorithm. The E-M algo- 
rithm is efficient and produces good results. To illustrate, we run a 10-center 
GMM on our sample of 500 digits data. The results are shown in Figure 6.15. 


The Expectation Maximization (E-M) Algorithm The E-M algorithm 
was introduced in the late 1970s and is an important heuristic for maximizing 
the likelihood function. It is based on the notion of a latent (hidden, unmea- 
sured, missing) piece of data that would make the optimization much easier. 
In the context of the Gaussian Mixture Model, suppose we knew which data 
points came from bump 1, bump 2,..., bump k. The problem would suddenly 
become much easier, because we can estimate the center and covariance matrix 
of each bump using the data from that bump alone; further we can estimate 
the probabilities w; by the fraction of data points in bump j. Unfortunately 
we do not know which data came from which bump, so we start with a guess, 
and iteratively improve this guess. The general algorithm is 
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x 


(a) 10-center GMM (b) Density contours 


Figure 6.15: The 10-center GMM for 500 digits data. (a) The probability 
density surface. The bumps are no longer spherical as in the Parzen window. 
The bump shapes are determined by the covariance matrices. (b) Density 
contours (red is high, blue is low) with the centers as black dots. The centers 
identify the ‘clusters’ (compare with the k-means clusters on page 31). 


E-M Algorithm for GMMs: 
1: Start with estimates for the bump membership of each xn. 


2: Estimates of wj, Hj, X; given the bump memberships. 
3: Update the bump memberships given wj, Hj, Uj; iterate to 
step 2 until convergence. 





To mathematically specify the algorithm, we will add one more ingredient, 
namely that the bump memberships need not be all or nothing. Specifically, 
at iteration t, let y,;(t) > 0 be the ‘fraction’ of data point x,, that belongs 
to bump j, with 9 Ynj = 1 (the entire point is allocated among all the 
bumps); you can view Yn; as the probability that xn was generated by bump j. 
The ‘number’ of data points belonging to bump 7 is given by 


N 
N; = XO ag: 
n=1 


The Yn; are the hidden variables that we do not know, but if we did know the 
Ynj, then we could compute estimates of wj, Hj, Uj: 


Nj 
Wy = N?’ 
i 
Hj = Wy 247 YnjXn; 
T 
ty = a Nee, = bpa (6.9) 
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Intuitively, the weights are the fraction of data belonging to bump j; the 
means are the average data point belonging to bump 7 where we take into 
account that only a fraction of a data point may belong to a bump; and, 
the covariance matrix is the weighted covariance of the data belonging to the 
bump. Once we have these new estimates of the parameters, we can update 
the bump memberships 7,;. To get yn;(t+ 1), we compute the probability 
that data point x, came from bump j given the parameters wj, pj, Xj. We 
want 


Yng(t + 1) = Ply xn]. 
By an application of Bayes rule, 
P(xnli) P [j] _ Nn My, Bj) wy 





We won’t have to compute P(x,,) in the denominator because it is independent 
of j and can be fixed by the normalization condition DA Ynj = 1. We thus 
arrive at the update for the bump memberships, 


wN (Xn; Hj, £3) 
pi weN (Xn; He, De) 


After updating the bump memberships, we iterate back to the parameters and 
continue until convergence. To complete the specification of the algorithm, we 
need to specify the initial bump memberships. A simple heuristic is to use 
the k-means clustering algorithm described earlier to obtain a ‘hard’ partition 
with ynj € {0,1}, and proceed from there. 


Ynj(t +1) = 


Exercise 6.15 


What would happen in the E-M algorithm described above if you initialized 
the bump memberships uniformly to ynj = 1/k? 


The E-M algorithm is a remarkable example of a learning algorithm that 
‘bootstraps’ itself to a solution. The algorithm starts by guessing some values 
for unknown quantities that you would like to know. The guess is probably 
quite wrong, but nevertheless the algorithm makes inferences based on the 
incorrect guess. These inferences are used to slightly improve the guess. The 
guess and inferences slightly improve each other in this way until at the end 
you have bootstrapped yourself to a decent guess at the unknowns, as well as 
a good inference based on those unknowns. 
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6.5 Problems 


Problem 6.1 Consider the following data set with 7 data points. 


(Lo) 2) (i) (A) =) (Le =) 
(B14) (2) 4) (Lo) 


(a) Show the decision regions for the 1-NN and 3-NN rules. 
(b) Consider the non-linear transform 

v1), [a] _ Jur +23 

T2 22 arctan(a2/x1)}’ 


which maps x to z. Show the classification regions in the x-space for the 
1-NN and 3-NN rules implemented on the data in the z-space. 


Problem 6.2 Use the same data from the previous problem. 


(a) Let the mean of all the -1 points be 21 and the mean of all the +1 points 
be 441. Suppose the data set were condensed into the two prototypes 
{(u-1,—1), (+1, +1)} (these points need not be data points, so they 
are called prototypes). Plot the classification regions for the 1-NN rule 
using the condensed data. What is the in-sample error? 


(b) Consider the following approach to condensing the data. At each step, 
merge the two closest points of the same class as follows: 


(x,c) + (x’,c) > G(x + x’), c). 


Again, this method of condensing produces prototypes. Continue con- 
densing until you have two points remaining (of different classes). 


Plot the 1-NN rule with the condensed data. What is the in-sample error? 





Problem 6.3 Show that the k-nearest neighbor rule with distance de- 
fined by d(x,x’) = (x — x’)"Q(x — x’), where Q is positive semi-definite, is 
equivalent to the k-nearest neighbor rule with the standard Euclidean distance 
in some transformed feature space. Explicitly construct this space. What is the 
dimension of this space. [Hint: Think about the rank of Q.] 


Problem 6.4 For the double semi-circle problem in Problem 3.1, plot the 
decision regions for the 1-NN and 3-NN rules. 
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Problem 6.5 Show that each of the Voronoi regions in the Voronoi 
diagram for any data set is convex. (A set C is convex if for any x, x’ € C and 
any à € [0,1], Ax + (1 — A)x’ € C.) 


Problem 6.6 For linear regression with weight decay, g(x) = X’Wreg. 
Show that 


N 
g(x) = 5 x"(Z7Z + AIT) xnyn- 


n=1 


A kernel representation of a hypothesis g is a representation of the form 


where K(x, x’) is the kernel function. What is the kernel function in this case? 


One can interpret the kernel representation of the final hypothesis from linear 
regression as a similarity method, where g is a weighted sum of the target values 
{yn}, weighted by the “similarity” K(x, Xn) between the point x and the data 


point xn. Does this look similar ©) to RBFs? 


Problem 6.7 Consider the hypothesis set H which contains all labeled 
Voronoi tessellations on K points. Show that dyc(H) = K. 


Problem 6.8 Suppose the target function is deterministic, so (x) = 0 
or n(x) = 1. The decision boundary implemented by f is defined as follows. 
The point x is on the decision boundary if every ball of positive radius centered 
on x contains both positive and negative points. Conversely if x is not on 
the decision boundary, then some ball around x contains only points of one 
classification. 


Suppose the decision boundary (a set of points) has probability zero. Show 
that the simple nearest neighbor rule will asymptotically (in N) converge to 
optimal error E3,, (with probability 1). 


Problem 6.9 Assume that the support of P is the unit cube in d 
dimensions. Show that for any €e, > 0, there is a sufficiently large N for 
which, with probability at least 1 — 6, 


sup ||x — xa (x)| < €. 
xEX 


[Hint: Partition [0,1]¢ into disjoint hypercubes of size e/v/d. For any x € 
[0, 1], show that at least one of these hypercubes is completely contained in 
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B.(x). Let p > 0 be the minimum P-measure of a hypercube. Since the 
number of hypercubes is finite, use a union bound to show that with high 
probability, every hypercube will contain at least k points for any fixed k as 
N > œ. Conclude that ||x — x(,)|| < € with high probability as N — oo.] 


Problem 6.10 Let Fout(k) = limw— oo Fout(gn(k)), where gn (k) is the 
k-nearest neighbor rule on a data set of size N, where k is odd. Let Ey; is 
the optimal out-of-sample probability of error. Show that (with probability 1), 


Ež s < Eout(k) < Eout(k — 2) < +++ < Eou(1) < 255 


Problem 6.11 For the 1-NN rule in two dimensions (d = 2) and data set 
(x1, Y1),---, (XN, yn), consider the Voronoi diagram and let V, be the Voronoi 
region containing the point xn. Two Voronoi regions are adjacent if they have 
a face in common (in this case an edge). Mark a point xn if the classification 
of every Voronoi region neighboring V» is the same as yn. Now condense the 
data by removing all marked points. 


(a) Show that the condensed data is consistent with the full data for the 
1-NN rule. 


(b) How does the out-of-sample error for the 1-NN rule using condensed data 
compare with the 1-NN rule on the full data (worst case and on average)? 


Problem 6.12 For the 1-NN rule, define the influence set Sn of point 
Xn as the set of points of the same class as x, which are closer to xn than to 
a point of a different class. So Xm € Sn if 


Xn — Xml < ||km: — Xml| for all m’ with ym’ on 
y. y 


We do not allow x,, to be in Sn. Suppose the largest influence set is Sn», the 
influence set of xn» (break ties according to the data point index). Remove all 
points in Sn». Update the remaining influence sets by deleting x,» and Sn» 
from them. Repeat this process until all the influence sets are empty. The set 
of points remaining is a condensed subset of D. 


(a) Show that the remaining condensed set is training set consistent. 


(b) Do you think this will give a smaller or larger condensed set than the 
CNN algorithm discussed in the text. Explain your intuition? 


(c) Give an efficient algorithm to implement this greedy condensing. What 
is your run time. 
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Problem 6.13 Construct a data set with 1,000 points as shown. 


The data are generated from a 4-center GMM. 
The centers are equally spaced on the unit cir- 
cle, and covariance matrices all equal to ol with 
o = 0.15. The weight of each Gaussian bump 
is i Points generated from the first and third 
centers have yn = +1 and the other two centers 
are —1. To generate a point, first determine a 
bump (each has probability 5) and then generate 
a point from the Gaussian for that bump. 





(a) Implement the CNN algorithm in the text and the condensing algorithm 
from the previous problem. 


(b) Run your algorithm on your generated data and give a plot of the con- 
densed data in each case. 


(c) Repeat this experiment 1,000 times and compute the average sizes of the 
condensed sets for the two algorithms. 


Problem 6.14 Run the condensed nearest neighbor (CNN) algorithm for 
3-NN on the digits data. Use all the data for classifying “1” versus “not 1”. 


(a) Set N = 500 and randomly split your data into a training set of size N 
and use the remaining data as a test/validation set. 


(b) Use the 3-NN algorithm with all the training data and evaluate its per- 
formance: report Ein and Etest. 


(c) Use the CNN algorithm to condense the data. Evaluate the performance 
of the 3-NN rule with the condensed data: report Ein and Eout. 


(d) Repeat parts (b) and (c) using 1,000 random training-test splits and 
report the average Ein and Eout for the full versus the condensed data. 


Problem 6.15 This problem asks you to perform an analysis of the branch 
and bound algorithm in an idealized setting. Assume the data is partitioned 
and each cluster with two or more points is partitioned into two sub-clusters of 
exactly equal size. (The number of data points is a power of 2). 


When you run the branch and bound algorithm for a particular test point x, 
sometimes the bound condition will hold, and sometimes not. If you generated 
the test point x randomly, the bound condition will hold with some probability. 
Assume the bound condition will hold with probability at least p > 0 at every 
branch, independently of what happened at other branches. 


Let T(N) be the expected time to find the nearest neighbor in a cluster with N 
points. Show: T(N) = O(dlog N + dN'°22°°-?)) (sublinear for p > 0). 


[Hint: Let N = 2"; show that T(N) < 2d+7(%) + (1—p)T(4).] 
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Problem 6.16 


(a) Generate a data set of 10,000 data points uniformly in the unit square 
[0, 1]? to test the performance of the branch and bound method: 


(i) Construct a 10-partition for the data using the simple greedy heuristic 
described in the text. 

(ii) Generate 10,000 random query points and compare the running time 
of obtaining the nearest neighbor using the partition with branch 
and bound versus the brute force approach which does not use the 
partition. 


(b) Repeat (a) but instead generate the data from a mixture of 10 gaussians 
with centers randomly distributed in [0, 1]? and identical covariances for 
each bump equal to oI where o = 0.1. 


(c) Explain your observations. 


(d) Does your decision to use the branch and bound technique depend on 
how many test points you will need to evaluate? 


Problem 6.17 Using Exercise 6.8, assume that p = 4 (the probability that 
the bound condition will hold at any branch point). Estimate the asymptotic 
running time to estimate the out-of-sample error with 10-fold cross validation 
to select the value of k, and compare with Exercise 6.6. 


Problem 6.18 An alternative to the k-nearest neighbor rule is the r- 
nearest neighbor rule: classify a test point x using the majority class among all 
neighbors xn within distance r of x. The r-nearest neighbor explicitly enforces 
that all neighbors contributing to the decision must be close; however, it does 
not mean that there will be many such neighbors. Assume that the support of 
P(x) is a compact set. 


(a) Show that the expected number of neighbors contributing to the decision 
for any particular x is order of Nrt. 


(b) Argue that as N grows, if Nr? — oo and r —> 0, then the classifier 
approaches optimal. 


(c) Give one example of such a choice for r (as a function of N, d). 


Problem 6.19 For the full RBFN in Equation (6.6) give the details of 
a 2-stage algorithm to fit the model to the data. The first stage determines 
the parameters of the bumps (their centers and covariance matrices); and, the 
second stage determines the weights. [Hint: For the first stage, think about 
the E-M algorithm to learn a Gaussian mixture model.] 
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Problem 6.20 [RBFs from regularization] This problem requires 
advanced calculus. Let d = 1; we wish to minimize a regularized error 


N 


Bash) = Dle) =u) +S a fae (POS, 
k=0 8. 


i=l 


where J is the regularization parameter. Assume h*) (x) (the kth derivative of 
h) decays to to 0 as |x| — oo. Let (x) be the Dirac delta function. 


(a) How would you select ax to penalize how curvy or wiggly h is? 
( 


b) [Calculus of Variations] Show that 


N 


Eaug(h) = fe dx [Sow -yiJ la i) 9 ak (ew . 
k=0 


=e, i=1 


Now find a stationary point of this functional: perturb h by 6h and 
assume that for all k, 5h“) — 0 as || = 0. Compute the change 
Eaug(h) = Eaug(h + 6h) — Eaug(h) and set it to 0. After integrating 
by parts and discarding all terms of higher order than linear in 6h, and 
setting all boundary terms from the integration by parts to 0, derive the 
following condition for h to be stationary (since 6h is arbitrary), 


N 


X (h(x) — wi) i(a = zi ADH arh™ (x) = 0. 


i=l 


(c) [Green’s Functions] L = EL- on is a linear differential opera- 
tor. The Green’s function G(x, 2’) for L satisfies LG (x, x’) = ô(x — 2’). 
Show that we can satisfy the stationarity condition by choosing 


N 
x)= 5 wiG (x, £i), 
i=1 


with w = (G + àI) ty, where Gi; = G(xi, £j). (h resembles an RBF 
with the Green’s function as kernel. If L is translation and rotation in- 
variant then G(x, x’) = G(||x — x’||), and we have an RBF.) 

(d) [Computing the Green's Function] Solve LG (x, x’) = ô(x — 2’) to get G. 
Define the Fourier transform and its inverse, 





G(fa)= f" aeaa) Gla!) = f7 fA, a). 


Fourier transform both sides of LG (x, x’) = (x — x’), integrate by parts, 
assuming that the boundary terms vanish, and show that 


Ran oo e2tif (a! —2) 
ee AI 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:6—47 











e-6. SIMILARITY-BASED METHODS 6.5. PROBLEMS 


where Q(f) = S229 ax(27 f)?" is an even function whose power series 
expansion is determined by the ap. If a, = 1/(2*k!), what is Q(f). Show 
that in this case the Green's function is the Gaussian kernel, 


l oe- 


G(a, 2’) = e 





(For regularization that penalizes a particular combination of the deriva- 
tives of h, the optimal non-parametric regularized fit is a Gaussian kernel 


RBF.) [Hint: You may need: f% _ dt en ae tot feel, Re(a) > 0.] 


Problem 6.21 Develop a linear programming approach to classification 
with similarity oracle d(-,-) (as in Problem 3.6). Assume RBF-like hypotheses: 


h(x) = sign (u + X wid(x, 3 è 


i=l 


where w is the weight parameter to be determined by fitting the data. Pick 
the weights that fit the data and minimize the sum of weight sizes >, |wi| 
(lasso regularization where we don’t penalize wo). 


(a) Show that to find the weights, one solves the minimization problem: 


N N 
minimize X` [wi] s.t. yn (v + Yds) > 1. 
ae =, 


i=1 
Do you expect overfitting? 


(b) Suppose we allow some error in the separation, then 


N 
Yn (o + oe rdw) 21- Gn; 


i=l 


where Cn > 0 are slack variables that measure the degree to which the 
data point (xn, yn) has been misclassified. The total error is ae Cae 
If you minimize a combination of the total weight sizes and the error with 
emphasis C on error, then argue that the optimization problem becomes 


minimize > [wn| + cy Ons (6.10) 
n=1 X n=1 
s.t. Yn (v + ndn) >1-G, 
Cn > 0, T 
where the inequalities must hold for n = 1,..., N. 


The minimization trades off sparsity of the weight vector with the extent of 
misclassification. To encourage smaller in-sample error, one sets C to be large. 
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Problem 6.22 Show that the minimization in (6.10) is a linear program: 


N N 
minimize X an +C% Gris 
wo, 
n=1 n=1 


—An < Wn < An, 


N 
s.t. Yn ( + So wnd(xn, x) 2>1-n, 


i=l 


Cn = 0, 


where the inequalities must hold for n = 1,...,.N. Formulate this linear pro- 
gram in a standard form as in Problem 3.6. You need to specify what the 
parameters A,a,b are and what the optimization variable z is. 


[Hint: Use auxiliary variables a1, ..., ayn to rewrite |w,| using linear functions.] 


Problem 6.23 Consider a data distribution, P(x, y) which is a mixture of 
k Gaussian distributions with means { p; }#_1 and covariance matrices {Xj }4_1; 
each Gaussian has probability p; > 0 of being selected, eae = 1; each 
Gaussian generates a positive label with probability mj. To generate (x,y), 
first select a Gaussians using probabilities pi,...,px. If Gaussian £ is selected, 
generate x from this Gaussian distribution, with mean je and covariance ¥p, 
and y = +1 with probability me (y = —1 otherwise). 


For test point x, show that the classifier with minimum error probability is 
S e EO — m) Ez (x p) 
f(x) = sign X wje 2 Hi) 5 Bil), 


where wj = p;(27r; — 1). [Hint: Show that the optimal decision rule can be 
written f(x) = sign(P[+1|x] — P[—1|x]). Use Bayes’ theorem and simplify.] 


(This is the RBF-network for classification. Since uj, 3, pj, mj are unknown, 
they must be fit to the data. This problem shows the connection between 
the RBF-network for classification and a very simple probabilistic model of the 
data. The Bayesians often view the RBF-network through this lens. ) 


Problem 6.24 [Determining the Number of Clusters k] Use the 
same input data (xn) as in Problem 6.13. 


(a) Run Lloyd's k-means clustering algorithm, for k = 1,2,3,4,..., out- 
putting the value of the k-means objective Ein(k). 

(b) Generate benchmark random data of the same size, uniformly over the 
smallest axis-aligned square containing the the actual data (this data has 
no well defined clusters). Run Lloyd's algorithm on this random data for 
k = 1,2,3,4,.... Repeat for several such random data sets to obtain the 
average k-means error as a function of k, Ef2"4(k). 
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(c) Compute and plot the gap statistic (as a function of k) 
G(k) = log ER" (k) — log Ein (k). 
(d) Argue that the maximum of the gap statistic is a reasonable choice for 


the number of clusters to use. What value for k do you get? 


(e) Repeat with different choices for o in Problem 6.13, and plot the value 
of k chosen by the gap statistic versus ø. Explain your result. 


Problem 6.25 [Novelty Detection] Novelty corresponds to the arrival 
of a new cluster. Use the same input data (xn) as in Problem 6.13. 
(a) Use the method of Problem 6.24 to determine the number of clusters. 


(b) Add data (one by one) from a 5th Gaussian centered on the origin with the 
same covariance matrix as the other four Gaussians. For each new point, 
recompute the number of clusters using the method of Problem 6.24. 
Plot the number of clusters versus £ the amount of data added. 


(c) Repeat (b) and plot, as a function of £, the average increase in the number 
of clusters from adding £ data points. 


(d) From your plot in (c), estimate £*, the number of data points needed to 
identify the existence of this new cluster. 


(e) Vary o (the variance parameter of the Gaussians) and plot £“ versus ø. 
Explain your results. 


Problem 6.26 Let V = {v1,...,var} be a universe of objects. Define 
the distance between two sets $1, S52 C V by 


d(S1, S2) = 1 — J(S1, S2), 


where J(S1, S2) = |S1 O S2|/|S1 U S2| is the Jaccard coefficient. Show that 
d(-,-) is a metric satisfying non-negativity, symmetry and the triangle inequality. 


Problem 6.27 Consider maximum likelihood estimation of the parameters 
of a GMM in one dimension from data x71,...,2N. The probability density is 


P(«) = 3 EE) . 


Wk 
exp | — 
2no?2 P 202 


(a) If K = 1 what are the maximum likelihood estimate of pı and of. 





(b) If K > 1, show that the maximum likelihood estimate is not well defined; 
specifically that the maximum of the likelihood function is infinite. 


(c) How does the E-M algorithm perform? Under what conditions does the 
E-M algorithm converge to a well defined estimator? 
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e-6. SIMILARITY-BASED METHODS 6.5. PROBLEMS 


(d) To address the problem in (b), define the “regularized” likelihood: For 
each x», define the 2c-interval Bn = [an — €, £n + €]. The eregularized 
likelihood is the probability that each x, € Bn. Intuitively, one does not 
measure £n, but rather a 2e-interval in which æn lies. 


(i) Show that 


K 
Plen € Br] = Sou (Av (==) _ Fy (==) 
kel Ok Ok 


where Fw(-) is the standard normal distribution function. 


(ii) For fixed e, show that the maximum likelihood estimator is now well 
defined. What about in the limit as € + 0? 


(iii) Give an E-M algorithm to maximize the e-regularized likelihood. 


Problem 6.28 Probability density estimation is a very general task in 
that supervised learning can be posed as probability density estimation. In 
supervised learning, the task is to learn f(x) from (x1, y1), ---, (XN, YN). 


(a) Let P(x,y) be the joint distribution of (x,y). For the squared error, 
Eout(h) = Epcx,y)[(h(x) — y)?]. Show that among all functions, the one 
which minimizes Fout is f(x) = Epcx,yy[y|x]. 











(b) To estimate f, first estimate P(x,y) and then compute E[y|x]. Treat 
the N data points as N “unsupervised” points z1,...,zn in R*+, where 
Zn = [Xn; Yn]. Suppose that you use a GMM to estimate P(z), so the 
parameters wz, Hk, Ux have been estimated (wx > 0, >>, we = 1), and 





Fea 3 walSel? -b ewe) "Se (eH) 
= L nD? © 


k=1 
where Sz; = J,- Let Sk = [i a) and uk = ea Show that 
k Ck Br 


K T: 
> Ûpe BOR) MK (OK) (By me bi (x —ar)) 


g(x) = Ely|x] = — 
D Ûpel Fk) Me (XO) 
k=1 


where Qk = Ax — bebz/ce and We = wp y/|Sk|/Ck. Interpret this func- 
tional form in terms of Radial Basis Functions. 


(c) If you non-parametric Parzen windows with spherical Gaussian kernel to 
estimate P(x, y), show that g(x) is an RBF, 


g(x) = Ely|x] = 5 


[Hint: This is a special case of the previous part: what are K, wr, uk, Sk ?] 
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